综合虚拟人类及其3D环境之间的自然相互作用对于众多应用程序(例如计算机游戏和AR/VR体验)至关重要。我们的目标是使人类与给定的3D场景进行互动,该场景由高级语义规格控制为动作类别和对象实例,例如“坐在椅子上”。将相互作用语义纳入生成框架中的主要挑战是学习一个共同表示,该表示有效地捕获了异质信息,包括人体的关节,3D对象几何以及相互作用的意图。为了应对这一挑战,我们设计了一种基于变压器的新型生成模型,其中铰接的3D人体表面点和3D对象共同编码在统一的潜在空间中,并且人与物体之间的相互作用语义是通过嵌入的。位置编码。此外,受到人类可以同时与多个对象相互作用的相互作用的组成性质的启发,我们将相互作用语义定义为不同原子动作对象对的组成。我们提出的生成模型自然可以结合不同数量的原子相互作用,从而无需复合相互作用数据即可合成组成的人类习惯相互作用。我们使用交互语义标签和场景实例分割扩展了Prox数据集,以评估我们的方法,并证明我们的方法可以通过语义控制生成现实的人类场景相互作用。我们的感知研究表明,我们合成的虚拟人类可以自然与3D场景相互作用,从而超过现有方法。我们将方法硬币命名,用于与语义控制的组成相互作用合成。代码和数据可在https://github.com/zkf1997/coins上获得。
translated by 谷歌翻译
像有声读物的综合一样,表达性语音综合仍然对样式表示学习和预测仍然具有挑战性。从参考音频或从文本预测样式标签中得出的标签需要大量标记的数据,这是昂贵的,并且难以准确定义和注释。在本文中,我们提出了一个新颖的框架,以一种自我监督的方式从丰富的纯文本中学习样式表示。它利用情感词典,并使用对比度学习和深度聚类。我们进一步将样式表示形式整合为多式变压器TTS中的条件嵌入。通过预测在同一数据集上训练的样式标签,但通过人类注释,我们的方法根据对声音域内和室外测试集的主观评估来改进结果,从而获得了改进的结果。此外,有了隐性的背景感知样式表示,长期综合音频的情感过渡似乎更自然。音频样本可在演示网络上找到。
translated by 谷歌翻译
神经端到端TTS模型的最新进展显示出在常规句子的TTS中表现出高质量的自然合成语音。但是,当TTS中考虑整个段落时,重现相似的高质量,在构建基于段落的TTS模型时需要考虑大量上下文信息。为了减轻培训的困难,我们建议通过考虑跨性别,嵌入式结构在培训中对语言和韵律信息进行建模。三个子模块,包括语言学意识,韵律和句子位置网络。具体而言,要了解嵌入在段落中的信息以及相应的组件句子之间的关系,我们利用语言学意识和韵律感知网络。段落中的信息由编码器捕获,段落中的句子间信息通过多头注意机制学习。段落中的相对句子位置由句子位置网络明确利用。拟议中的TTS模型在女性普通话中录制的讲故事的音频语料库(4.08小时)接受了培训,该模型表明,它可以产生相当自然而良好的语音段落。与基于句子的模型相比,可以更好地预测和渲染的跨句子上下文信息,例如连续句子之间的断裂和韵律变化。在段落文本上进行了测试,其长度与培训数据的典型段落长度相似,比训练数据的典型段落长得多,新模型产生的TTS语音始终优先于主观测试和基于句子的模型和在客观措施中确认。
translated by 谷歌翻译
通过不懈的研究增强了StyleGAN的语义可控性。尽管现有的弱监督方法在沿一个属性操纵样式代码方面很好地奏效,但操纵多个属性的准确性被忽略了。多属性表示很容易在stylegan潜在空间中纠缠,而顺序编辑会导致错误积累。为了解决这些局限性,我们设计了一个动态样式操纵网络(Dystyle),其结构和参数因输入样本而异,以执行非线性和自适应操纵潜在代码,以进行灵活和精确的属性控制。为了有效且稳定地优化障碍网络,我们提出了动态的多属性对比度学习(DMACL)方法:包括动态的多重构造对比度和动态多属性对比损失,同时将各种属性从生成中删除模型的图像和潜在空间。结果,我们的方法表明了沿多个数字和二进制属性的细粒度分离的编辑。与现有样式操纵方法的定性和定量比较验证了我们方法在多属性控制的准确性和身份保存方面的优越性,而不会损害光真相。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译